-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-8631] Support of hoodie.populate.meta.fields
for Flink append mode
#12516
[HUDI-8631] Support of hoodie.populate.meta.fields
for Flink append mode
#12516
Conversation
@@ -90,6 +91,7 @@ public HoodieRowDataCreateHandle(HoodieTable table, HoodieWriteConfig writeConfi | |||
this.fileId = fileId; | |||
this.newRecordLocation = new HoodieRecordLocation(instantTime, fileId); | |||
this.preserveHoodieMetadata = preserveHoodieMetadata; | |||
this.skipMetadataWrite = skipMetadataWrite; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the flag preserveHoodieMetadata
already control this behavior, there is another PR raised by @usberkeley for fixing all the scenarios BTW: #12404
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I saw #12404 and was confused by ticket name, which mention Flink table config hoodie.populate.meta.fields
, and didn't find any changes in hudi-flink-client
or hudi-flink-datasource
that will change current behavior. My point of view I described in #12404 (comment)
And to support my comment, I've created this MR that shows the lack of support of hoodie.populate.meta.fields
in Flink.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
preserveHoodieMetadata
actually used in the code as an indicator to get metadata from row data or generate it by calling corresponding methods. And it actually a little bit confusing naming then. I believe that values for preserveHoodieMetadata
could be described by this schema:
Looks like preserveHoodieMetadata
could be true only for clustering operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the flag preserveHoodieMetadata
indicates whether the source row includes the metadata fields already, for table service like clustering, this should be true by default(because clustering is just a rewrite). For regular write, the metadata fields should be generated on the fly.
Let's check in which case the option hoodie.populate.meta.fields
could be false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User sets value for hoodie.populate.meta.fields
option, which is true
by default. And in description for this config, "append only/immutable data" is mentioned as use case:
hudi/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
Lines 261 to 265 in 9da3221
public static final ConfigProperty<Boolean> POPULATE_META_FIELDS = ConfigProperty | |
.key("hoodie.populate.meta.fields") | |
.defaultValue(true) | |
.withDocumentation("When enabled, populates all meta fields. When disabled, no meta fields are populated " | |
+ "and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processing"); |
For this reason, in this MR I supported
hoodie.populate.meta.fields
in Flink only for append mode.
For quick check I use SQL queries like the following ones, which used for append mode:
CREATE TABLE hudi_debug (
id INT,
part INT,
desc STRING,
PRIMARY KEY (id) NOT ENFORCED
)
WITH (
'connector' = 'hudi',
'path' = '...',
'table.type' = 'COPY_ON_WRITE',
'write.operation' = 'insert',
'hoodie.populate.meta.fields' = 'false'
);
INSERT INTO hudi_debug VALUES
(1,100,'aaa'),
(2,200,'bbb');
Expected results: there is no exceptions during
SELECT * FROM hudi_debug;
and corresponding parquet files in HDFS don't contain columns with metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also found that we could write MOR table in upsert mode without metadata. Call stack in this case will include HoodieAppendHandle
. But we couldn't read result MOR table by Flink later due to exception thrown during:
SELECT * FROM hudi_debug;
I've created separate bug for MOR without meta columns read: HUDI-8785, will fix it in a separate MR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we atleast add an integration test in ITTestHoodieDataSource
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danny0405
I've added ITTestHoodieDataSource::testWriteWithoutMetaColumns
. But for proper checking, it would be great to write data by Flink, and then read it by Spark, because in Spark
SELECT * FROM table;
will return all columns including those with metadata.
And it would be really useful check of engines interoperability. I've created a corresponding task HUDI-8788.
I will restart Azure CI, but we have a problem with timeout: |
@hudi-bot run azure |
Change Logs
Currently,
hoodie.populate.meta.fields
is not supported in Flink. This config should be used for append mode, so this MR adds support for this case.Before, master, 3cb874f:
After:
TPC-H lineitem table is used for profiling.
14% faster write in append mode for Flink when
hoodie.populate.meta.fields = false
.The next possible optimization for this scenario is to think about do we really need Bloom filters in this use case, because it costs 16% of CPU after optimization of this MR:
Impact
No
Risk level (write none, low medium or high below)
Low
Documentation Update
No need. There is no mention in "All Configurations" page that
hoodie.populate.meta.fields
is not supported for Flink.Contributor's checklist